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ABSTRACT 


This  paper  focuses  on  the  nature  of  representations  in  connectionist  models. 
It  addresses  two  issues:  (1)  Can  connectionist  models  develop  representa¬ 
tions  which  possess  internal  structure  and  which  provide  the  basis  for  produc¬ 
tive  and  systematic  behavior,  and  (2)  Can  representations  which  are  funda¬ 
mentally  context-sensitive  support  grammatical  behavior  which  appears  to  be 
abstract  and  general?  Results  from  two  simulations  are  reported..  The  simu¬ 
lations  address  problems  in  the  distinction  between  type  and  token,  the  repre¬ 
sentation  of  lexical  categories,  and  the  representation  of  grammatical  struc¬ 
ture.  The  results  suggest  that  connectionist  representations  can  indeed  pos¬ 
sess  internal  structure  and  enable  systematic  behavior,  and  that  a  mechanism 
which  is  sensitive  to  context  is  capable  of  capturing  generalizations  of  varying 
degrees  of  abstractness. 


INTRODUCTION 

Connectionist  models  appear  to  pro¬ 
vide  a  new  and  different  framework  for  un¬ 
derstanding  cognition.  It  is  therefore  natural 
to  wonder  how  these  models  might  differ 
from  traditional  theories,  and  what  their  ad¬ 
vantages  or  disadvantages  might  be.  Re¬ 
cent  discussion  has  focussed  on  a  number  of 
topics,  including  the  treatment  of  regular  and 
productive  behavior  (rules  vs.  analogy),  the 
form  of  knowledge  (explicit  vs.  implicit),  the 
ontogeny  of  knowledge  (innate  vs.  ac¬ 
quired),  and  the  nature  of  connectionist  rep¬ 
resentations. 

This  latter  issue  is  particularly  impor¬ 
tant  because  one  of  the  critical  ways  in 
which  cognitive  theories  may  differ  is  in  the 


representational  apparatus  they  make  avail¬ 
able.  Our  current  understanding  of  connec¬ 
tionist  representations  is  at  best  partial,  and 
there  is  considerable  diversity  of  opinion 
among  those  who  are  actively  exploring  the 
topic  (cf.  Dolan  &  Dyer,  1987;  Dolan  & 
Smolensky,  1988;  Feldman  &  Ballard,  1982; 
Hanson  &  Burr,  1987;  McMillan  &  Smolen¬ 
sky,  1988;  Hinton,  1988;  Hinton, 
McGelland,  &  Rumelhart  (1986); 
McClelland,  St.  John,  &  Taraban  (1989); 
Pollack,  1988;  Ramsey,  1989;  Rumelhart, 
Hinton,  &  Williams,  1986;  Shastii  & 
Ajjanagadde,  1989;  Smolensky,  1987a, 
1987b,  1987c,  1988;  Touretzky  &  Hinton, 
1985;  Touretzky,  1986,  1989;  van  Gelder,  in 
press). 

In  this  paper  I  would  like  to  focus  on 
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some  of  the  specific  questions  raised  by 
Fodor  &  Pylyshyn  (1988).  Fodor  &  Pyly- 
shyn  express  concern  that  whereas  Classi¬ 
cal  theories  (e.g.,  the  Language  of  Thought, 
Fodor,  1976)  are  committed  to  complex 
mental  representations  which  reflect  combi¬ 
natorial  structure,  connectionist  representa¬ 
tions  seem  to  be  atomic,  and  therefore 
(given  the  limited  and  fixed  resources  avail¬ 
able  to  them)  finite  in  number.  And  this  ap¬ 
pears  to  be  at  odds  with  what  we  believe  to 
be  necessary  for  human  cognition  in  general, 
and  human  language  in  particular. 

I  believe  that  Fodor  and  Pylyshyn 
are  right  in  stressing  the  need  for  represen¬ 
tations  which  support  complex  and  system¬ 
atic  patterning,  which  reflect  both  the  com¬ 
binatorics  and  compositionality  of  thought, 
and  which  enable  an  open-ended  produc¬ 
tions.  What  their  analysis  does  not  make 
self-evident  is  that  these  desiderata  can 
only  be  achieved  by  the  so-called  Classical 
theories,  or  by  connectionist  models  which 
implement  those  theories.  Fodor  &  Pyly¬ 
shyn  present  a  regrettably  simplistic  picture 
of  current  linguistic  theory.  What  they  call 
the  Classical  theory  actually  encompasses  a 
heterogeneous  set  of  theories,  not  all  of 
which  are  obviously  compatible  with  the 

Language  of  Thought.  Funhermore,  there 
have  in  recent  years  been  well-articulated 
linguistic  theories  which  do  not  share  the 
basic  premises  of  the  Language  of  Thought 
(e.g..  Chafe,  1970;  Fauconnier,  1985; 

Fillmore,  1982;  Givon,  1984;  Hopper  & 
Thompson,  1980;  Kuno,  1987;  Lakoff,  1987; 
Langacker,  1987).  Thus  the  two  alternatives 
presented  by  Fodor  and  Pylyshyn  (that 

connectionism  must  either  implement  the 

Language  of  Thought  or  fail  as  a  cognitive 
model)  are  unnecessarily  bleak  and  do  not 
exhaust  the  range  of  possibilities. 

Still,  it  is  possible  to  phrase  the 
questions  posed  by  Fodor  &  Pylyshyn  in  a 
more  general  way  which  might  be  profitably 


pursued:  What  is  the  nature  of  connection¬ 
ist  representations?  Are  they  necessarily 
atomistic  or  can  they  possess  internal  struc¬ 
ture?  Can  that  structure  be  used  to  account 
for  behavior  which  reflects  both  general  and 
ideosyncratic  patterning?  Can  connectionist 
representations  with  finite  resources  pro¬ 
vide  an  account  for  apparently  open-ended 
productive  behavior?  How  might  connec¬ 
tionist  representations  differ  from  those  in 
the  Language  of  Thought?  One  strength  of 
connectionist  models  that  is  often  empha¬ 
sized  is  their  sensitivity  to  context  and  abili¬ 
ty  to  exhibit  graded  responses  to  subtle  dif¬ 
ferences  in  stimuli  (e.g.,  McClelland, 
St  John,  &  Taraban,  1989).  But  sometimes 
language  behavior  seems  to  be  character¬ 
ized  by  abstract  patterns  which  are  less 
sensitive  to  context.  So  another  question  is 
whether  models  which  are  fundamentally 
context-sensitive  are  also  able  to  arrive  at 
generalizations  which  are  highly  abstract. 

In  this  paper  I  present  results  from 
two  sets  of  simulations.  These  simulations 
were  designed  to  probe  the  above  issues, 
with  the  goal  of  providing  some  insight  into 
the  representational  capacity  of  connection¬ 
ist  models.  The  paper  is  organized  in  two 
sections.  The  first  section  reports  empirical 
results.  Two  connectionist  networks  were 
taught  tasks  in  which  an  abstract  structure 
underlay  the  stimuli  and  task.  The  intent 
was  to  create  problems  which  would  encour¬ 
age  the  development  of  internal  representa¬ 
tions  which  reflected  that  abstract  struc¬ 


ture.  Both  the  performance  of  the  networks _ 

as  well  as  the  analysis  of  their  solutions  il-'®^ 


lustrates  the  development  of 


internal  repre¬ 


sentations  which  are  richly  structured.  □ 

These  results  are  discussed  at  greater  n 

length  in  the  second  section,  and  related  to  _ 


the  broader  question  of  the  usefulness  of 


the  connectionist  framework  for  modeling 
cognitive  phenomena,  and  possible  differenc-  ^ 
es  from  the  Qassical  approach.  id/or — 


opc-claJL 
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Parti:  SIMULATIONS 

Language  is  structured  in  a  number 
of  ways.  One  important  kind  of  structure 
has  to  do  with  the  structure  of  the  catego¬ 
ries  of  language  elements  (e.g.,  words). 
The  first  simulation  addressed  the  question 
of  whether  a  connectionist  model  can  induce 
the  lexical  category  structure  underlying  a 
set  of  stimuli.  A  second  way  in  which  lan¬ 
guage  is  structured  has  to  do  with  the  possi¬ 
ble  ways  in  which  strings  can  be  combined 
(e.g.,  the  grammatical  structure).  The  sec¬ 
ond  simulation  addresses  that  issue. 

LEXICAL  CATEGORY  STRUCTURE 

Words  may  be  categorized  with  re¬ 
spect  to  many  factors.  These  include  tradi¬ 
tional  notions  such  as  noun,  verb,  etc.;  the 
argument  structure  they  are  associated 
with;  and  their  semantic  features.  One  of 
the  consequences  of  lexical  category  struc¬ 
ture  is  word  order.  Not  all  classes  of  words 
may  appear  in  any  position.  Furthermore, 
certain  classes  of  words,  e.g.,  transitive 
verbs,  tend  to  cooccur  with  other  words  (as 
we  shall  see  in  the  next  simulation,  these 
cooccurrence  facts  can  be  quite  complex). 

The  goal  of  the  first  simulation  was 
to  see  if  a  network  could  leant  the  lexical 
category  structure  which  was  implicit  in  a 
language  corpus.  The  overt  form  of  the  lan¬ 
guage  items  was  arbitrary,  in  the  sense  that 
the  form  of  the  lexical  items  contained  no  in¬ 
formation  about  their  lexical  category.  How¬ 
ever,  the  behavior  of  the  lexical 
item— defined  in  terms  of  cooccurrence  re¬ 
strictions — reflected  their  membership  in  im¬ 
plicit  classes  and  subclasses.  The  question 
was  whether  or  not  the  netwoik  could  in¬ 
duce  these  classes. 


Network  Architecture 

Hme  is  an  important  element  in  lan¬ 
guage,  and  so  the  question  of  how  to  repre¬ 
sent  serially  ordered  inputs  is  crucial.  Vari¬ 
ous  proposal  have  been  advanced  (for  re¬ 


views,  see  Elman,  in  press;  Mozer,  1988. 
The  approach  taken  here  involves  treating 
the  network  as  a  simple  dynamical  system 
in  which  previous  states  are  made  available 
as  an  adthtional  input  (Jordan,  1986).  In 
Jordan’s  work  the  prior  state  was  derived 
from  the  output  units  on  the  previous  time 
cycle.  In  the  work  here,  the  prior  state 
comes  from  the  hidden  unit  patterns  on  the 
previous  cycle.  Because  the  hidden  units  are 
not  taught  to  assume  specific  values,  this 
means  that  they  can  develop  representa¬ 
tions,  in  the  course  of  learning  a  task,  which 
encode  the  temporal  structure  of  the  task. 
In  other  words,  the  hidden  units  learn  to  be- 
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come  a  kind  of  memory  which  is  very  task- 
specific. 

The  type  of  network  used  in  the  first 
simulation  is  shown  in  Figure  1.  This  net¬ 
work  is  basically  a  3-layer  network  with  the 
customary  feed-forward  connections  ftom  in¬ 
put  units  to  hidden  units,  and  fiom  hidden 
units  to  output  units.  There  are  an  additional 
set  of  units,  called  context  units,  which  pro¬ 
vide  for  limited  recurrence  (and  so  this  may 
be  called  a  simple  recurrent  network). 
These  context  units  are  activated  on  a  one- 
for-one  basis  by  the  hidden  units,  with  a 
fixed  weight  of  1.0. 

The  result  is  that  at  each  time  cycle 
the  hidden  unit  activations  are  copied  into 
the  context  units;  on  the  next  time  cycle,  the 
context  combines  with  the  new  input  to  acti¬ 
vate  the  hidden  units.  The  hidden  units 
therefore  take  on  the  job  of  mapping  new  in¬ 
puts  and  prior  states  to  the  output.  Because 
they  themselves  constitute  the  prior  state, 
they  must  develop  representations  which  fa¬ 
cilitate  this  input/output  mapping.  The  sim¬ 
ple  recurrent  network  has  been  studied  in  a 
number  of  tasks  (Elman,  in  press;  Hare, 
Corina,  &  Cottrell,  1988;  Servan-Schreiber, 
Cleeremans,  &  McClelland,  1988).  In  this 
first  simulation,  there  were  31  input  units, 
ISO  hidden  and  context  units,  and  31  output 
units. 


Stimuli  and  Task 

A  lexicon  of  29  nouns  and  verbs  was 
chosen.  Words  were  represented  as  31 -bit 
binary  vectors  (two  extra  bits  were  re¬ 
served  for  another  purpose).  Each  words 
was  randomly  assigned  a  unique  vector  in 
which  only  one  bit  was  turned  on.  A  sen¬ 
tence-generating  program  was  then  used  to 
create  a  corpus  of  10,000  2-  and  3-word 
sentences.  The  sentences  reflected  certain 
properties  of  the  words.  For  example,  only 
animate  nouns  occurred  as  the  subject  of  the 
verb  eat,  and  this  verb  was  only  followed 
by  edible  substances.  Finally,  the  words  in 
successive  sentences  were  concatenated, 
so  that  a  stream  of  27,354  vectors  was  cre¬ 
ated  This  formed  the  input  set. 

The  task  was  simply  for  the  network 
to  take  successive  words  from  the  input 
stream  and  to  predict  the  subsequent  word 
(by  producing  it  on  the  output  layer).  After 
each  word  was  input,  the  output  was  com¬ 
pared  with  the  actual  next  word,  and  the 
backpropagation  of  error  learning  algorithm 
(Rumelhart,  Hinton,  &  Williams,  1986)  was 
used  to  adjust  the  network  weights.  Words 
were  presented  in  order,  with  no  breaks  be¬ 
tween  sentences.  The  network  was  trained 
on  6  passes  through  the  corpus. 

The  prediction  task  was  chosen  for 
several  reasons.  First,  it  makes  minimal 
assumptions  about  special  knowledge  re¬ 
quired  for  training.  The  teacher  function  is 
simple  and  the  information  provided  avail¬ 
able  in  the  world  at  the  next  moment  in 
time.  Thus,  there  are  no  a  priori  theoretical 
commitments  which  might  bias  the  out¬ 
come.  Second,  although  the  task  is  simple 
and  should  not  be  taken  as  a  iiKxiel  of  com¬ 
prehension,  it  does  seem  to  be  the  case  that 
much  of  what  listeners  do  involves  anticipa¬ 
tion  of  future  input  (Grosjean,  1980; 
Marslen- Wilson  &  Tyler,  1980;  Salasoo 
&  Pisoni,  1985). 
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Results 

Because  the  sequence  is  non-detCT- 
ministic,  short  of  memorizing  the  sequence, 
the  network  cannot  succeed  in  exact  predic¬ 
tions.  That  is,  the  underlying  grammar  and 
lexical  category  structure  provides  a  set  of 
constraints  on  the  form  of  sentences,  but  the 
sentences  themselves  involve  a  high  degree 
of  optionality.  Thus,  measuring  the  perfor¬ 
mance  of  the  network  in  this  simulation  is 
not  straightforward.  Root  mean  squared  er¬ 
ror  at  the  conclusion  of  training  had  dropped 
to  0.88.  However,  this  result  is  not  impres¬ 
sive.  When  output  vectors  are  sparse,  as 
those  used  in  this  simulation  were  (only  1 
out  of  31  output  bits  was  to  be  turned  on), 
the  network  quickly  learns  to  reduce  error 
dramatically  by  turning  all  the  output  units 
off.  This  drops  error  from  the  initial  random 
value  of  ~15.5  to  1.0,  which  is  close  to  the  fi¬ 
nal  rmse  value  of  0.88. 

Although  the  prediction  task  is  non- 
deterministic,  it  is  also  true  that  word  order 
is  not  random  or  unconstrained.  For  any  giv¬ 
en  sequence  of  words  there  are  a  limited 
number  of  possible  successors.  Under  these 
circumstances,  it  would  seem  more  appropri¬ 
ate  to  ask  whether  or  not  the  network  has 
learned  what  the  class  of  valid  successors  is, 
at  each  point  in  time.  We  therefore  might  ex¬ 
pect  that  the  network  should  learn  to  acti¬ 
vate  the  output  nodes  to  some  value  propor¬ 
tional  to  the  probability  of  occurrence  of  each 
word  in  that  context. 

Therefore,  rather  than  evaluating  final 
network  performance  using  the  rms  error  cal¬ 
culated  by  comparing  the  network’s  output 
with  the  actual  next  word,  we  can  compare 
the  output  with  the  probability  of  occurrence 
of  possible  successors.  These  values  can  be 
derived  empirically  from  the  training  data 
base  (for  details  see  Elman,  in  press);  such 
calculation  yields  a  "likelihood  output  vector" 
which  is  appropriate  for  each  input,  and 


which  reflects  the  context-dependent  expec¬ 
tations  given  the  training  base  (where  con¬ 
text  is  defined  as  extending  fix>m  the  begin¬ 
ning  of  the  sentence  to  the  input).  Note  that 
it  is  appropriate  to  use  these  likelihood  vec¬ 
tors  only  for  the  evaluation  phase.  Training 
must  be  done  on  the  actual  successor  words 
because  the  point  is  to  force  the  network  to 
learn  the  context-dependent  probabilities  for 
itself. 

Evaluated  in  this  manner,  the  error  on 
the  training  set  is  0.053  (sd:  0.100).  The  co¬ 
sine  of  the  angle  between  output  vectors  and 
likelihood  vectors  provides  another  measure 
of  performance  (which  normalizes  for  length 
differences  in  the  vectors);  the  mean  cosine 
is  0.916  (sd:  0.123),  indicating  that  the  two 
vectors  on  average  have  very  similar 
shapes.  Objectively,  the  performance  ap¬ 
pears  to  be  quite  good. 

Lexical  categories 

The  question  to  be  asked  now  is  how 
this  performance  has  been  achieved.  One 
way  to  answer  this  is  to  see  what  sorts  of  in¬ 
ternal  representations  the  network  develops 
in  order  to  carry  out  the  prediction  task.  This 
is  particularly  relevant,  given  the  focus  of  the 
current  paper.  The  internal  representations 
are  instantiated  as  activation  patterns  across 
the  hidden  units  which  are  evoked  in  re¬ 
sponse  to  each  word  in  its  context  These 
patterns  were  saved  during  a  testing  phase 
during  which  no  learning  took  place.  For  each 
of  the  29  unique  words  a  mean  vector  was 
then  computed  which  averaged  across  all  oc¬ 
currences  of  the  word  in  various  contexts. 
These  mean  vectors  were  then  subjected  to 
hierarchical  clustering  analysis.  Figure  2 
shows  the  tree  constructed  from  the  hidden 
unit  patterns  for  the  29  lexical  items. 

The  tree  in  Figure  2  shows  the  simi¬ 
larity  structure  of  the  internal  representations 
of  the  29  lexical  items.  The  form  of  each  item 
is  randomly  assigned  (and  orthogonal  to  all 
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Figure  2 


other  items),  and  so  the  basis  for  the  simi¬ 
larity  in  the  internal  representations  is  the 
way  in  which  these  words  "behave"  with  re¬ 
gard  to  the  task. 

The  network  has  discovered  that 
there  are  several  major  categories  of  words. 
One  large  category  corresponds  to  verbs;  an¬ 
other  category  corresponds  to  nouns.  The 
verb  category  is  broken  down  into  groups 
which  require  a  direct  object;  which  are  in¬ 


transitive;  and  for  which  a  direct  object  is  op¬ 
tional.  The  noun  category  is  divided  into  ma¬ 
jor  groups  for  animates  and  inanimates.  An¬ 
imates  are  divided  into  human  and  non-hu¬ 
man;  the  non-humans  are  sub-divided  into 
large  animals  and  srrutll  animals.  Inanimates 
the  divided  into  breakables,  edibles,  and 
miscellaneous. 

This  category  structure  reflects  facts 
about  the  possible  sequential  ordering  of  the 
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inputs.  The  network  is  not  able  to  predict 
the  precise  order  of  specific  words,  but  it 
recognizes  that  (in  this  corpus)  there  is  a 
class  of  inputs  (v/z.,  verbs)  which  typically 
follow  other  inputs  (viz.,  nouns).  This 
knowledge  of  class  behavior  is  quite  de¬ 
tailed;  from  the  fact  that  there  is  a  class  of 
items  which  always  precedes  chase, 
break,  and  smash,  it  infers  a  category  of 
large  animals  (or  possibly,  aggressors). 

Several  points  should  be  empha¬ 
sized.  First,  the  category  structure  ap¬ 
pears  to  be  hierarchical.  Dragons  are 
large  animals,  but  also  members  of  the 
class  [-human,  +animate]  nouns.  The  hier¬ 
archical  interpretation  is  achieved  through 
the  way  in  which  the  spatial  relations  of  the 
representations  are  organized.  Representa¬ 
tions  which  are  near  one  another  in  repre¬ 
sentational  space  form  classes,  and  higher- 
level  categories  correspond  to  larger  and 
more  general  regions  of  this  space. 

Second,  it  is  also  the  case  that  the 
hierarchicality  and  category  boundaries  are 
"soft".  This  does  not  prevent  categories 
from  being  qualitatively  distinct  by  being  far 
from  each  other  in  space  with  no  overlap. 
But  there  may  also  be  entities  which  share 
properties  of  otherwise  distinct  categories, 
so  that  in  some  cases  category  membership 
may  be  marginal  or  ambiguous. 

Finally,  the  content  of  the  categories 
is  not  known  to  the  network.  The  network 
has  no  information  available  which  would 
ground  the  structural  information  in  the  real 
world.  This  is  both  a  plus  and  a  minus.  Ob¬ 
viously,  a  full  account  of  language  process¬ 
ing  needs  to  provide  such  grounding.  On  the 
other  hand,  it  is  interesting  that  the  evi¬ 
dence  for  category  structure  can  be  inferred 
so  readily  on  the  basis  of  language-internal 
evidence  alone. 


Type-token  distinctions 

The  tree  shown  in  Figure  2  was  con¬ 
structed  from  activation  patterns  averaged 
across  context  It  is  also  possible  to  cluster 
activation  patterns  evoked  in  response  to 
words  in  the  various  contexts  in  which  they 
occur.  When  the  context-sensitive  hidden 
units  patterns  are  clustered,  it  is  found  that 
the  large-scale  structure  of  the  tree  is  iden¬ 
tical  to  that  shown  in  Figure  2.  However, 
each  terminal  leaf  is  now  replaced  with  fur¬ 
ther  arborization  for  all  occurrences  of  the 
word  (there  are  no  instances  of  lexical  items 
appearing  on  inappropriate  branches). 

This  finding  bears  on  the  type/token 
problem  in  an  important  way.  In  this  simula¬ 
tion,  the  context  makes  up  an  important  part 
of  the  internal  representation  of  a  word.  In¬ 
deed,  it  is  somewhat  misleading  to  speak  of 
the  hidden  unit  representations  as  word  rep¬ 
resentations  in  the  conventional  sense, 
since  these  patterns  also  reflect  the  prior 
context.  As  a  result,  it  is  literally  the  case 
that  every  occurrence  of  a  lexical  item  has  a 
separate  internal  representation.  We  can¬ 
not  point  to  a  canonical  representation  for 
John;  instead  there  are  representations  for 
John.],  John2, ...  John^.  These  are  the  to¬ 
kens  of  John,  and  the  fact  that  they  are  dif¬ 
ferent  is  the  way  the  system  marks  what 
may  be  subtle  but  important  meaning  differ¬ 
ences  associated  with  the  speciric  token. 
The  fact  that  these  are  all  tokens  of  the 
same  type  is  not  lost,  however.  These  to¬ 
kens  have  representations  which  are  ex¬ 
tremely  close  in  space  —  closer  to  each  oth¬ 
er  by  far  than  to  any  other  entity.  Even 
more  interesting  is  that  the  spatial  organiza¬ 
tion  within  the  token  space  is  not  random 
but  reflects  differences  in  context  which  are 
also  found  among  tokens  of  other  items. 
The  tokens  of  boy  which  occur  in  subject 
position  tend  to  cluster  together,  as  distinct 
from  tokens  of  boy  which  occur  in  object  po¬ 
sition.  This  distinction  is  maiked  in  the 
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same  way  for  tokens  of  other  nouns.  Thus, 
the  network  has  learned  not  only  about  types 
and  tokens,  and  categories  and  category 
members;  it  also  has  learned  a  grammatical 
role  distinction  which  cuts  across  lexical 
items. 

This  simulation  has  involved  a  task  in 
which  the  category  structure  of  inputs  was 
an  important  determinant  of  their  behavior. 
The  category  structure  was  apparent  in  their 
behavior  only;  their  external  form  provided  no 
useful  information.  We  have  seen  that  the 
network  makes  use  of  spatial  organization  in 
order  to  capture  this  category  structure. 

We  turn  next  to  a  problem  in  which 
the  lexical  category  structure  provides  only 
one  part  of  the  solution,  and  in  which  the  net¬ 
work  must  learn  abstract  grammatical  struc¬ 
ture. 


Representation  of  grammatical 
structure 

In  the  previous  simulation  there  was 
little  interesting  structure  of  the  sort  that  re¬ 
lated  words  to  one  another.  Most  of  the  rel¬ 
evant  information  regarding  sequential  be¬ 
havior  was  encoded  in  terms  of  invariant 
properties  of  items.  Although  lexical  infor¬ 
mation  plays  an  important  role  in  language,  it 
actually  accounts  for  only  a  small  range  of 
facts.  Words  are  processed  in  the  contexts 
of  other  words;  they  inherit  properties  from 
the  specific  grammatical  structure  in  which 
they  occur.  TTiis  structure  can  be  quite  com¬ 
plex,  and  it  is  not  clear  that  the  kind  of  cate¬ 
gory  structure  supported  by  the  spatial  distri¬ 
bution  of  representations  is  sufficient  to  cap¬ 
ture  the  structure  which  belongs,  not  to 
individual  words,  but  to  particular  configura¬ 
tions  of  words. 

As  we  consider  this  issue,  we  also 
note  that  till  now  we  have  neglected  an  im¬ 
portant  dimension  along  which  structure  may 


be  manifest,  time.  The  clustering  technique 
used  in  the  previous  simulation  informs  us 
of  the  similarity  relations  along  spatial  di¬ 
mensions.  The  technique  tells  us  nothing 
about  the  patterns  of  movement  through 
space.  This  is  unfortunate,  since  the  net- 
worics  we  are  using  are  dynamical  systems 
whose  states  change  over  time.  Qustering 
groups  states  according  to  the  metric  of  Eu¬ 
clidean  distance  but  in  so  doing  discards  the 
information  about  whatever  temporal  rela¬ 
tions  may  hold  between  states.  This  infor¬ 
mation  is  clearly  relevant  if  we  are  con¬ 
cerned  about  grammatical  structure.  Con¬ 
sider  the  sentences 

(la)  The  man  saw  the  car. 

(lb)  The  man  ^3^  called  ihe 

cops. 

On  the  basis  of  the  results  of  the  previous 
simulation,  we  would  expect  that  the  repre¬ 
sentations  for  the  word  car  in  these  two 
sentences  would  be  extremely  similar.  Not 
only  are  they  the  same  lexical  type,  but  they 
both  appear  in  clause-final  position  as  the 
object  of  the  same  verb. 

But  we  might  also  wish  to  have  their 
representations  capture  an  important  struc¬ 
tural  difference  between  them.  Car  in  sen¬ 
tence  (la)  occurs  at  the  end  of  the  sen¬ 
tence;  it  brings  us  to  a  state  from  which  we 
should  move  into  another  class  of  states 
that  are  associated  with  the  onsets  of  new 
sentences.  In  sentence  (lb),  car  is  also  at 
the  end  of  a  clause,  but  occurs  in  a  matrix 
sentence  which  has  not  yet  been  corr^let- 
ed.  There  are  grammatical  obligations 
which  remain  unfulfilled.  We  would  like  the 
state  that  is  associated  with  car  in  this 
context  to  lead  us  to  the  class  of  states 
which  might  conclude  the  main  clause. 

The  issue  of  how  to  understand  the 
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temporal  structure  of  state  trajectories  will 
thus  figiue  importantly  in  our  attempts  to 
understand  the  representation  of  grammad- 
cal  structure. 


Stimuli  and  Task 

The  stimuli  in  this  simulation  were 
based  on  a  lexicon  of  23  items.  These  in¬ 
cluded  8  nouns,  12  verbs,  the  relative  pro¬ 
noun  who,  and  an  end-of-sentence  indica¬ 
tor,  .  Each  item  was  represented  by  a 
randomly  assigned  26-bit  vector  in  which  a 
single  bit  was  set  to  1  (3  bits  were  reserved 
for  another  purpose).  A  phrase  structure 
grammar,  shown  in  Table  1,  was  used  to 
generate  sentences.  The  resulting  sentenc¬ 
es  possessed  certain  important  properties. 
These  include  the  following. 


(a)  Agreement 

Subject  nouns  agree  with  their  verbs. 
Thus,  tor  example,  (2a)  is  grammatical  but 
not  (2b)  (the  training  coipus  consisted  of 
positive  examples  only;  thus  the  starred  ex¬ 
amples  below  did  not  occur). 

(2a)  John  feeds  dogs. 

(2b)  *Boys  sees  Mary. 

Words  are  not  mariced  for  number 
(singular/plural),  form  class  (verb/noun,  etc.), 
or  grammatical  role  (subject/object,  etc.). 
The  network  must  learn  first  that  there  are 
items  which  function  as  what  we  would  call 
nouns,  verbs,  etc.;  then  it  must  learn  which 
items  are  examples  of  singular  and  plural; 
and  then  it  must  learn  which  nouns  are 


S  NPVP  V 
NP  PropN  I  N  I  N  RC 
VP  ->  V  (NP) 

RC  who  NP  VP  1  who  VP  ( NP ) 

N  ->  boy  1  girl  |  cat  |  dog  \  boys  \  girls  |  cats  \  dogs 
PropN  ->  John  \  Mary 

V  -4  chase  \  feed  \  see  |  hear  |  walk  \  live  \  chases  \ 
feeds]  sees  \  hears  \  walks  |  lives 

Additional  restrictions: 

•  number  agreement  between  N  &  V  within  clause,  and 
(where  appropriate)  between  head  N  &  subordinate  V 

•  verb  arguments: 

hit,  feed  ->  require  a  direct  object 
see,  hear  optional  allow  a  direct  object 
walk,  live  preclude  a  direct  object 
(observed  also  for  head/verb  relations  in  relative 
clauses) 

Table  1 
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subjects  and  which  are  objects  (since 
agreement  only  holds  between  subject 
nouns  and  their  verbs). 

(b)  Verb  argument  structure 

Verbs  fall  into  three  classes:  those 
that  require  direct  objects,  those  that  permit 
an  optional  direct  object,  and  those  that  pre¬ 
clude  direct  objects.  As  a  result,  sentences 
(3a-d)  are  grammatical,  whereas  sentences 
(3e,  3f)  are  ungrammatical. 

(3a)  Girls  feed  dogs.  {D.o.  required) 

(3b)  Girls  see  boys.  (D.o.  optional) 

(3c)  Girls  see.  (D.o.  optional) 

(3d)  Girls  live.  {D.o.  precluded) 

(3e)  *Girls  feed. 

(3f)  *Girls  live  dogs. 


(4b)  Dog  chases 


Sentence  (4c),  which  seems  to  conform  to  the 
pattern  established  in  (3),  is  ungrammatical. 

(4c)  *Dog  chases  dog 

Similar  complications  arise  for  the 
agreements  facts.  In  simple  sentences 
agreement  involves  N1  -  VI.  In  complex 
sentences,  such  as  (5a),  that  regularity  is  vi¬ 
olated,  and  any  straightforward  attempt  to 
generalize  it  to  sentences  with  multiple 
clauses  would  lead  to  the  ungrammatical  (5b). 


Again,  the  type  of  verb  is  not  overtly 
marked  in  the  input,  and  so  the  class 
membership  needs  to  be  inferred  at  the 
same  time  as  the  cooccurrence  facts  are 
learned. 


(5a)  Dog  who  boys  feed 


fc)  Interactions  with  relative  clause.^ 

Both  the  agreement  and  the  verb  ar¬ 
gument  facts  are  complicated  in  relative 
clauses.  While  direct  objects  normally  fol¬ 
low  the  verb  in  simple  sentences,  some  rela¬ 
tive  clauses  have  hs  direct  object  as  the 
head  of  the  clause,  in  which  case  the  net¬ 
work  must  learn  to  recognize  that  the  direct 
object  has  already  been  filled  (even  though 
it  occurs  before  the  verb).  'H.us,  the  normal 
pattern  in  simple  sentences  (3a-d)  appears 
also  in  (4a),  but  contrasts  with  (4b), 

(4a)  Dog  sees  girl. 


(5b)  *Dog  who  see  girl. 


(d)  Recursion 

The  grammar  permits  recursion 
through  the  presence  of  relative  clause 
(which  expand  to  noun  phrases  which  may  in¬ 
troduce  yet  other  relative  clauses,  etc.).  This 
leads  to  sentences  such  as  (6)  in  which  the 
grammatical  phenomena  noted  in  (a-c)  may 
be  extended  over  a  considerable  distance. 


(6)  Boys 


who  dogs  chase 


see 


hear. 


(e)  Viable  sentences 

One  of  the  literals  inserted  by  the  grammar  is 
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which  occurs  at  the  end  of  sentences. 
This  end-of-sentence  marker  can  of  course 
potentially  occur  anywhere  in  a  string  where 
a  sentence  is  viable  (in  the  sense  that  it  is 
grammatically  well-formed  and  may  at  that 
point  be  terminated).  Thus  in  sentence  (7), 
the  arrows  indicate  positions  where  a 
might  legally  occur. 

(7)  Boys  see  dogs  who  see  girls  who  hear. 

t  t  t 

*  *  * 

Tne  data  in  (4-7)  are  examples  of 
the  sorts  of  phenomena  which  linguists  ar¬ 
gue  cannot  be  accounted  for  without  ab¬ 
stract  representations;  it  is  these  represen¬ 
tations  rather  than  the  surface  strings  on 

which  the  correct  grammatical  generaliza¬ 

tions  are  made. 

A  network  of  the  form  shown  in  Fig¬ 
ure  3  was  trained  on  the  prediction  task 
(layers  are  shown  as  rectangles;  numbers 

indicate  the  number  of  nodes  in  each  layer). 


26  I  ~l  OUTPUT 


10  * —  i 


26  I  —I 

INPUT 


Figure  3 


The  training  data  were  generated 
from  the  phrase  structure  grammar  given  in 
Table  1.  At  any  given  point  during  training, 
the  training  set  consisted  of  10,000  sentenc¬ 
es  which  were  presented  to  the  network  5 
times.  (As  before,  sentences  were  concate¬ 
nated  so  that  the  input  stream  proceeded 
smoothly  without  breaks  between  sentenc¬ 
es.)  However,  the  composition  of  these 
sentences  varied  over  time.  The  following 
training  regimen  was  used  in  order  to  pro¬ 
vide  for  incremental  training.  The  network 
was  trained  on  5  passes  through  each  of  the 
following  4  corpora. 

Phase  1:  The  first  training  set  con¬ 
sisted  exclusively  of  simple  sentences. 
This  was  accomplished  by  eliminating  all 
relative  clauses.  The  result  was  a  corpus  of 
34,605  words  forming  10,000  sentences 
(each  sentence  includes  the  terminal  "."). 

Phase  2:  The  network  was  then  ex¬ 
posed  to  a  second  corpus  of  10,0(X)  sentenc¬ 
es  which  consisted  of  25%  complex  sentenc¬ 
es  and  75%  simple  sentences  (complex  sen¬ 
tences  were  obtained  by  permitting  relative 
clauses).  Mean  sentence  length  was  3.92 
(minimum  3  words,  maximum  13  words). 

Phase  3:  The  third  corpus  increased 
the  percentage  of  complex  sentences  to 
50%,  with  mean  sentence  length  of  4.38 
(minimum:  3  words,  maximum:  13  words). 

Phase  4:  The  fourth  consisted  of 
10,000  sentences,  75%  complex,  25%  sim¬ 
ple.  Mean  sentence  length  was  6.02 
(minimum:  3  words,  maximum:  16  words). 

This  staged  learning  strategy  was 
developed  in  response  to  results  of  earlier 
pilot  work.  In  this  work,  it  was  found  that 
the  network  was  unable  to  learn  the  task 
when  given  the  full  range  of  complex  data 
from  the  beginning  of  training.  However, 
when  the  network  was  permitted  to  focus  on 
the  simpler  data  first,  it  was  able  to  learn 
the  task  quickly  and  then  move  on  success- 
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fully  to  more  complex  patterns.  The  impor¬ 
tant  aspect  to  this  was  that  the  earlier  train¬ 
ing  constrained  later  learning  in  a  useful 
way;  the  early  training  forced  the  network 
to  focus  on  canonical  '  ersions  of  the  prob¬ 
lems  which  apparently  ::reH.ed  a  good  basis 
for  then  solving  the  more  difficult  forms  of 
the  same  problems. 

Results 

At  the  conclusion  of  the  fourth  phase 
of  training,  the  weights  were  frozen  at  their 
final  values  and  network  performance  was 
tested  on  a  novel  set  of  data,  generated  in 
the  same  way  as  the  last  training  corpus. 
The  technique  described  in  the  previous  sim¬ 
ulation  was  used;  context-dependent  likeli¬ 
hood  vectors  were  generated  for  each  word 
in  every  sentences.  These  vectors  repre¬ 
sented  the  empirically  derived  probabilities 
of  occurrence  for  all  possible  predictions, 
given  the  sentence  context  up  to  that  point. 
The  rms  error  of  network  predictions,  com¬ 
pared  against  the  likelihood  vectors,  was 
0.177  (sd:  0.463);  the  mean  cosine  of  the 
angle  between  the  vectors  was  0.852 
(sd:  0.259).  Although  this  performance  is 
not  as  good  as  in  the  previous  simulation,  it 
is  still  quite  good.  And  the  task  is  obvious¬ 
ly  much  more  difficult. 

These  gross  measures  of  perfor¬ 
mance  however  do  not  tell  us  how  well  the 
network  has  done  in  each  of  the  specific 
problem  areas  posed  by  the  task.  Let  us 
look  at  each  area  in  turn. 


fal  Agreement  in  simple  sentences 

Agreemeijt  in  simple  sentences  is 
shown  in  Figures  4a  and  4b. 

The  network’s  predictions  following 
the  word  boy  are  that  either  a  singular  verb 
will  follow  (words  in  all  three  singular  verb 
categories  are  activated,  since  it  has  no  ba¬ 
sis  for  predicting  the  type  of  verb),  or  else 


that  the  next  word  may  be  the  relative  pro¬ 
noun  who.  Conversely,  when  the  input  is 
the  word  boys,  the  expectation  is  that  a 
verb  in  the  plural  will  follow,  or  else  the  rela¬ 
tive  pronoun.  Similar  expectations  hold  for 
the  other  nouns  in  the  lexicon. 
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Figure  4 

(a)  Graph  of  network  predictions  following  presentation 
of  the  word  boy.  Predictions  are  shown  as  activations 
for  words  grouped  by  category.  S  stands  for  end-of- 
sentence  (*.*);  W  stands  for  who;  N  and  V  represent 
rrouns  and  verbs;  1  and  2  indicates  singular  or  plural; 
and  type  of  verb  is  indicated  by  N,  R,  O  (direct  ob)ect 
not  possible,  required,  or  optional).  (b)  Graph  of 
network  predictions  following  presentation  of  the  word 
boys. 
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(b)  Verb  argument  structure  in  simple 
sentences 

Figure  5  shows  network  predictions 
following  an  initial  noun  and  then  a  verb 
from  each  of  the  three  different  verb  types. 

When  the  verb  is  lives,  the  net¬ 
work’s  expectation  is  that  the  following 
item  will  be  (which  is  in  fact  the  only 
successor  permitted  by  the  grammar  in  this 
context).  The  verb  sees,  on  the  other 
hand,  may  either  be  followed  by  a  or  op¬ 
tionally  by  a  direct  object  (which  may  be  a 
singular  or  plural  noun,  or  proper  noun).  Fi- 
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nally,  the  verb  chases  requires  a  direct  ob¬ 
ject,  and  the  network  learns  to  expect  a 
noun  following  this  and  other  verbs  in  the 
same  class. 


fc)  Interactions  with  relative  clauses 

The  examples  so  far  have  all  in¬ 
volved  simple  sentences.  The  agreement 
and  verb  argument  facts  are  more  complicat¬ 
ed  in  complex  sentences.  Figure  6  shows 
the  network  predictions  for  each  word  in  the 
sentence  boys  who  mary  chases  feed 
cats.  If  the  network  were  generalizing  the 
pattern  for  agreement  found  in  the  simple 
sentences,  we  might  expect  the  network  to 
predict  a  singular  verb  following  ...mary 
chases...  (insofar  as  it  predicts  a  verb  in 
this  position  at  all;  conversely,  it  might  be 
confused  by  the  pattern  N1  N2  VI).  But  in 
fact,  the  prediction  (6d)  is  correctly  that  the 
next  verb  should  be  in  the  singular  in  order 
to  agree  with  the  first  noun.  In  so  doing,  it 
has  found  some  mechanism  for  representing 
the  long-distance  dependency  between  the 
main  clause  noun  and  main  clause  verb,  de¬ 
spite  the  presence  of  an  intervening  noun 
and  verb  (with  their  own  agreement  rela¬ 
tions)  in  the  relative  clause. 

Note  that  this  sentence  also  illus¬ 
trates  the  sensitivity  to  an  interaction  be¬ 
tween  verb  argument  structure  and  relative 
clause  structure.  The  verb  chases  takes 
an  obligatory  direct  object.  In  simple  sen¬ 
tences  the  direct  object  follows  the  verb  im¬ 
mediately;  this  is  also  true  in  many  complex 
sentences  (e.g.,  boys  who  chase  mary 
feed  cats).  In  the  sentence  displayed, 
however,  the  direct  object  (boys)  is  the 
head  of  the  relative  clause  and  appears  be¬ 
fore  the  verb.  This  requires  that  the  net¬ 
work  learn  (a)  there  are  items  which  func¬ 
tion  as  nouns,  verbs,  etc..;  (b)  which  items 


Figure  5 
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fall  into  which  classes;  (c)  there  are  sub¬ 
classes  of  verbs  which  have  different  cooc¬ 
currence  relations  with  nouns,  correspond¬ 
ing  to  verb-direct  object  restrictions;  (d) 
which  verbs  fall  into  which  classes;  and  (e) 
when  to  expect  that  the  direct  object  will  fol¬ 
low  the  verb,  and  when  to  know  that  it  has 
already  appeared.  The  network  appears  to 
have  learned  this,  because  in  panel  (d)  we 
see  that  it  expects  that  chases  will  be  fol¬ 
lowed  by  a  verb  (the  main  clause  verb,  in 
this  case)  rather  than  a  noun. 

An  even  subtler  point  is  demonstrat¬ 
ed  in  (6c).  The  appearance  of  boys  fol¬ 
lowed  by  a  relative  clause  containing  a  dif¬ 
ferent  subject  (who  Mary...)  primes  the 
network  to  expect  that  the  verb  which  fol- 
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lows  must  be  of  the  class  that  requires  a  di¬ 
rect  object,  precisely  because  a  direct  object 
filler  has  already  appeared.  In  other  words, 
the  network  correctly  responds  to  the  pres¬ 
ence  of  a  filler  (boys)  not  only  by  knowing 
where  to  expect  a  gap  (following  chases); 
it  also  learns  that  when  this  filler  corre¬ 
sponds  to  the  object  position  in  the  relative 
clause,  £  verb  is  required  which  has  the  ap¬ 
propriate  argument  structure. 


Network  analysis 

The  natural  question  to  ask  at  this 
point  is  how  the  network  has  learned  to  ac¬ 
complish  the  task.  It  was  initially  assumed 
that  success  on  this  task  would  constitute 
prima  facie  evidence  for  the  existence  of  in¬ 
ternal  representations  which  possessed  ab¬ 
stract  structure.  That  is,  it  seemed  reason¬ 
able  to  believe  that  in  order  to  handle  agree¬ 
ment  and  argument  structure  facts  in  the 
presence  of  relative  clauses,  the  network 
would  be  required  to  develop  representa¬ 
tions  which  reflected  constituent  structure, 
argument  structure,  grammatical  category, 
grammatical  relations,  and  number. 

Having  achieved  success  on  the 
task,  we  now  would  like  to  test  this  as¬ 
sumption.  In  the  previous  simulation,  hier¬ 
archical  clustering  was  used  to  reveal  the 
use  of  spatial  organization  at  the  hidden  unit 
level  for  categorization  purposes.  However, 
the  clustering  technique  makes  it  difficult  to 
see  patterns  which  exist  over  time.  Some 
states  may  have  significance  not  simply  in 
terms  of  their  similarity  to  other  states,  but 
with  regard  to  the  ways  in  which  they  con¬ 
strain  movement  into  subsequent  state 
space  (recall  the  examples  in  (1)).  Because 
clustering  ignores  the  temporal  information, 
it  hides  this  information.  What  would  be 
more  useful  would  be  to  look  at  the  trajecto¬ 
ries  through  state  space  over  time  which 
correspond  to  the  internal  representations 


evoked  at  the  hidden  unit  layer  as  a  network 
processes  a  given  sentence. 

Phase- state  portraits  of  this  sort  are 
commonly  limited  to  displaying  not  more 
than  a  few  state  variables  at  once,  simply 
because  movement  in  more  than  three  di¬ 
mensions  is  difficult  to  graph.  The  hidden 
unit  activation  patterns  in  the  current  simu¬ 
lation  take  place  over  70  variables.  These 
patterns  are  distributed,  in  the  sense  that 
none  of  the  hidden  units  alone  provides  use¬ 
ful  information;  the  information  instead  lies 
along  hyperplanes  which  cut  across  multiple 
units. 

However,  it  is  possible  to  identify 
these  hyperplanes  using  principle  compo¬ 
nent  analysis.  This  involved  passing  thing 
training  set  through  the  trained  network 
(with  weights  frozen)  and  saving  the  hidden 
unit  pattern  for  produced  in  response  to  each 
new  input.  The  covariance  matrix  of  the  set 
of  hidden  unit  vectors  is  calculated,  and  then 
the  eigenvectors  for  the  covariance  matrix 
are  found.  The  eigenvectors  are  ordered  by 
the  magnitude  of  tiieir  eigenvalues,  and  are 
used  as  the  new  basis  for  describing  the 
original  hidden  unit  vectors.  This  new  set  of 
dimensions  has  the  effect  of  giving  a  some¬ 
what  more  localized  description  to  the  hid¬ 
den  unit  patterns,  because  the  new  dimen¬ 
sions  now  correspond  to  the  location  of 
meaningful  activity  (defined  in  terms  of  vari¬ 
ance)  in  the  hyperspace.  Furthermore, 
since  the  dimensions  are  ordered  in  terms  of 
variance  accounted  for,  we  can  now  look  at 
phase  state  portraits  of  selected  dimen¬ 
sions,  starting  with  those  with  largest 
eigenvalues. 

Agreement 

The  sentences  in  (8)  were  presented 
to  the  network,  and  the  hidden  unit  patterns 
captured  after  each  word  was  processed  in 
sequence. 
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(8a)  boys  hear  boys . 

(8b)  boy  hears  boys . 

(8c)  boy  who  boys  chase  chases  boy . 

(8d)  boys  who  boys  chase  chase  boy . 

(These  sentences  were  chosen  to  minimize 
differences  due  to  lexical  content  and  to 
make  it  possible  to  focus  on  differences  to 
grammatical  structure.  (8a)  and  (8b)  were 
contained  in  the  training  data;  (8c)  and  (8d) 
were  novel  and  had  never  been  presented  to 
the  network  during  learning.) 

By  examining  the  trajectories 
through  state  space  along  various  dimen¬ 
sions,  it  was  apparent  that  the  second  prin¬ 
ciple  component  played  an  important  role  in 
marking  number  of  the  main  clause  subject. 
Figure  7  shows  the  trajectories  for  (8a)  and 
(8b);  the  trajectories  are  overlaid  so  that 
the  differences  are  more  readily  seen.  The 


Trajectories  through  state  sp%e  for  sentences  (8a)  and 
(8b).  Each  point  marks  the  position  aiong  the  second 
principle  component  of  hidden  unit  space,  after  the  indi¬ 
cated  word  has  been  input.  Magnitude  of  the  second 
principle  component  is  measured  along  the  ordinate;  time 
(Le.,  order  of  word  in  sentence)  is  measured  along  the 
abs^sa.  In  this  and  subsequent  graphs  the  sentence- 
final  word  is  marked  with  a  ]S. 


paths  are  similar  and  diverge  only  during  the 
first  word,  indicating  the  difference  in  the 
number  of  the  initial  noun.  The  difference  is 
slight  and  is  eliminated  after  the  main  (i.e., 
second  chase)  verb  has  been  input  This  is 
apparently  because,  for  these  two  sentenc¬ 
es  (and  for  the  grammar),  number  informa¬ 
tion  does  not  have  any  relevance  for  this 
task  once  the  main  verb  has  been  received. 

It  is  not  difficult  to  imagine  sentenc¬ 
es  in  which  number  information  may  have  to 
be  retained  over  an  intervening  constituent; 
sentences  (8c)  and  (8d)  are  such  exam¬ 
ples.  In  both  these  sentences  there  is  an 
identical  relative  clause  which  follows  the 
initial  noun  (which  differs  with  regard  to 
number  in  the  two  sentences).  This  materi¬ 
al,  who  boys  chase,  is  irrelevant  as  far  as 
the  agreement  requirements  for  the  main 
clause  verb.  The  trajectories  through  state 
space  for  these  two  sentences  have  been 
overlaid  and  are  shown  in  Figure  8;  as  can 
be  seen,  the  differences  in  the  two  trajecto¬ 
ries  are  maintained  until  the  main  clause 


Trajectories  through  state  space  for  senter>ces  (8c) 
and  (8d). 


-16- 


Elman 


Representation  &  Structure 


verb  is  reached,  at  which  point  the  states 
converge. 

Verb  argument  structure 

The  representation  of  verb  argument 
structure  was  examined  by  probing  with 
sentences  containing  instances  of  the  three 
different  classes  of  verbs.  Sample  sentenc¬ 
es  are  shown  in  (9). 

(9a)  boy  walks . 

(9b)  boy  sees  boy . 

(9c)  boy  chases  boy . 

The  first  of  these  contains  a  verb  which  may 
not  take  a  direct  object;  the  second  takes  an 
option  direct  object;  and  the  third  requires  a 
direct  object.  The  movement  through  state 
space  as  these  three  sentences  are  pro¬ 
cessed  are  shown  in  Figure  9. 

This  figure  illustrates  how  the  net¬ 
work  encodes  several  aspects  of  grammati¬ 
cal  structure.  Nouns  are  distinguished  by 
role;  subject  nouns  for  all  three  sentences 


Figure  9 

Trajectories  through  state  space  for  sentences  (9a), 
(9b),  and  (9c).  Principal  component  1  is  plotted  along  the 
ab^issa;  principal  component  3  is  plotted  along  the 
ordinate. 


appear  in  the  upper  right  portion  of  the 
space,  and  object  nouns  appear  below  them. 
(Principal  component  4,  not  shown  here,  en¬ 
codes  the  distinction  between  verbs  and 
nouns,  collapsing  across  case.)  Verbs  are 
differentiated  with  regard  to  their  argument 
structure.  Chases  requires  a  direct  object, 
sees  takes  an  optional  direct  object,  and 
walks  precludes  an  object  The  difference 
is  reflected  in  a  systematic  displacement  in 
the  plane  of  principal  components  1  and  3. 

Relative  clauses 

The  presence  of  relative  clauses  in¬ 
troduces  a  complication  into  the  grammar,  in 
that  the  representations  of  number  and  verb 
argument  structure  must  be  clause-specific. 
It  would  be  useful  for  the  network  to  have 
some  way  to  represent  the  constituent 
structure  of  sentences. 

The  trained  network  was  given  the 
following  sentences. 

(iOa)  boy  chases  boy . 

(10b)  boy  chases  boy  who  chases  boy  . 

(10c)  boy  who  chases  boy  chases  boy . 

(lOd)  boy  chases  boy  who  chases  boy  who 
chases  boy . 

The  first  sentence  is  simple;  the  other  three 
are  instances  of  embedded  sentences.  Sen- 
tencelOa  was  contained  in  the  training  data; 
sentences  10c,  lOd,  and  lOe  were  novel  and 
had  not  been  presented  to  the  network  dur¬ 
ing  the  learning  phase. 

The  trajectories  through  state  space 
for  these  four  sentences  (principal  compo¬ 
nents  1  and  11)  are  shown  in  Figure  10. 
Panel  (10a)  shows  the  basic  pattern  associ¬ 
ated  with  what  is  in  fact  the  matrix  sentenc¬ 
es  for  all  four  sentences.  Comparison  of  this 
figtuv  with  panels  (10b)  and  (10c)  shows 
that  the  trajectory  for  the  matrix  sentence 
appears  to  follow  the  same  for,  the  matrix 
subject  noun  is  in  the  lower  left  region  of 
state  space,  the  matrix  verb  appears  above 
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Figure  10 

Movement  through  state  space  for  senternes  (lOa-d).  Principal  component  1  is  (fisplayed  along  the 
abscissa;  principal  component  1 1  is  displayed  along  the  ordinate 


-18- 


Elman 


Representation  &  Structure 


it  and  to  the  left,  and  the  matrix  object  noun 
is  near  the  upper  middle  region.  (Recall  that 
we  are  looking  at  only  2  of  the  70  dimen¬ 
sions;  along  other  dimensions  the  nounA'crb 
distinction  is  preserved  categorically.)  The 
relative  clause  appears  involve  a  replication 
of  this  basic  pattern,  but  displaced  toward 
the  left  and  moved  slightly  downward,  rela¬ 
tive  to  the  matrix  constituents.  Moreover, 
the  exact  position  of  the  relative  clause  ele¬ 
ments  indicates  which  of  the  matrix  nouns 
are  modified  Thus,  the  relative  clause  modi¬ 
fying  the  subject  noun  is  closer  to  it,  and  the 
relative  clause  modifying  the  object  noun  arc 
closer  to  it.  This  trajectory  pattern  was 
found  for  all  sentences  with  the  same  gram¬ 
matical  form;  the  pattern  is  thus  systematic. 

Figure  (lOd)  shows  what  happens 
when  there  are  multiple  levels  of  embed¬ 
ding.  Successive  embeddings  are  repre¬ 
sented  in  a  manner  which  is  similar  to  the 
way  that  the  first  embedded  clause  is  distin¬ 
guished  from  the  main  clause;  the  basic  pat¬ 
ter  for  the  clause  is  replicated  in  region  of 
state  space  which  is  displaced  from  the  ma¬ 
trix  material.  This  displacement  provides  a 
systematic  way  for  the  network  to  encode 
the  depth  of  embedding  in  the  current  state. 
However,  the  reliability  of  the  encoding  is 
limited  by  the  precision  with  which  states 
are  represented,  which  in  turn  depends  on 
factors  such  as  the  number  of  hidden  units 
and  the  precision  of  the  numerical  values.  In 
the  current  simulation,  the  representation 
degraded  after  about  three  levels  of  embed¬ 
ding.  The  consequences  of  this  degradation 
on  performance  (in  the  prediction  task)  are 
different  for  different  types  of  sentences. 
Sentences  involving  center  embedding  (e.g., 
8c  and  8d),  in  which  the  level  of  embedding 
is  crucial  for  maintaining  correct  agreement, 
arc  more  adversely  affected  than  sentences 
involving  so-called  tail-recursion  (e.g., 
lOd).  In  these  latter  sentences  the  syntac¬ 
tic  structures  in  principle  involve  recursion, 
but  in  practice  the  level  of  embedding  is  not 


relevant  for  the  task  (i.e.,  does  not  affect 
agreement  or  verb  argument  structure  in  any 
way). 

Figure  lOd  is  interesting  in  another 
respect.  Given  the  nature  of  the  prediction 
task,  it  is  actually  not  necessary  for  the  net¬ 
work  to  carry  forward  any  information  from 
prior  clauses.  It  would  be  sufficient  for  the 
network  to  represent  each  successive  rela¬ 
tive  clause  as  an  iteration  of  the  previous 
pattern.  Yet  the  two  relative  clauses  are 
differentiated.  Similarly,  Servan-Schreiber, 
Cleeremans,  &  McClelland  (1988)  found 
that  when  a  simple  recurrent  network  was 
taught  to  predict  inputs  that  had  been  gener¬ 
ated  by  a  finite  state  automaton,  the  net¬ 
work  developed  internal  representations 
which  corresponded  to  the  FSA  states; 
however,  it  also  redundantly  made  finer- 
grained  distinctions  which  encoded  the  path 
by  which  the  state  had  been  achieved,  even 
though  this  information  was  not  used  for  the 
task.  It  thus  seems  to  be  a  property  of 
these  networks  that  while  they  are  able  to 
encode  state  in  a  way  which  minimizes  con¬ 
text  as  far  as  behavior  is  concerned,  their 
nonlinear  nature  allows  them  to  remain  sen¬ 
sitive  to  context  at  the  level  of  internal  rep¬ 
resentation. 

Part  II:  Discussion 


The  basic  question  addressed  in  this 
paper  is  whether  or  not  connectionist  mod¬ 
els  are  capable  of  complex  representations 
which  possess  internal  structure  and  which 
are  productively  estensible.  This  question 
is  of  particularly  of  interest  with  regards  to  a 
more  general  issue:  How  useful  is  the  con¬ 
nectionist  paradigm  as  a  framework  for  cog¬ 
nitive  models?  In  this  context,  the  nature 
of  representations  interacts  with  a  number 
of  other  closely  related  issues.  So  in  order 
to  understand  the  significance  of  the  present 
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results,  it  may  be  useful  first  to  consider 
briefly  two  of  these  other  issues.  The  first 
is  the  status  of  rules  (whether  they  exist, 
whether  they  are  explicit  or  implicit);  the 
second  is  the  notion  of  computational  power 
(whether  it  is  sufficient,  whether  it  is  appro¬ 
priate). 

It  is  sometimes  suggested  that  con- 
nectionist  models  differ  from  Classical  mod¬ 
els  in  that  the  latter  rely  on  rules  whereas 
connectionist  models  are  typically  not  rule 
systems.  Although  at  first  glance  this  ap¬ 
pears  to  be  a  reasonable  distinction,  it  is  not 
actually  clear  that  the  distinction  gets  us 
very  far. 

The  basic  problem  is  that  it  is  not  ob¬ 
vious  what  is  meant  by  a  rule.  In  the  most 
general  sense,  a  rule  is  a  mapping  which 
takes  an  input  and  yields  an  output.  Clear¬ 
ly,  since  many  (although  not  all)  neural  net¬ 
works  function  as  input/output  systems  in 
which  the  bulk  of  the  machinery  implements 
some  transformation,  it  is  difficult  to  sec 
how  they  could  not  be  thought  of  as  rule- 
systems. 

But  perhaps  what  is  meant  is  that 
the  form  of  the  rules  differs  in  Classical 
models  and  connectionist  networks?  One 
suggestion  has  been  that  rules  are  stated 
explicitly  in  the  former,  whereas  they  are 
only  implicit  in  networks.  This  is  a  slippery 
issue,  and  there  is  an  unfonunate  ambiguity 
in  what  is  meant  by  implicit  or  explicit. 

One  sense  of  explicit  is  that  a  rule  is 
physically  present  in  the  system  in  its  form 
as  a  rule-,  and  furthermore,  that  that  physical 
presence  is  important  to  the  correct  function¬ 
ing  of  the  system.  However,  Kirsh  (1989) 
points  out  diat  our  intuitions  as  to  what 
counts  as  physical  presence  are  highly  unre¬ 
liable  and  sometimes  contradictory.  What 
seems  to  really  be  at  stake  is  the  speed 
with  which  information  can  be  made  avail¬ 
able.  If  this  is  true,  and  Kirsh  argues  the 
point  persuasively,  then  the  quality  of  ex¬ 


plicitness  does  not  belong  to  data  structures 
alone.  One  must  also  take  into  account  the 
nature  of  the  processing  system  involved, 
since  information  in  the  same  form  may  be 
easily  accessible  in  one  processing  system 
and  inaccessible  in  another. 

Unfortunately,  our  understanding  of 
the  information  processing  capacity  of  neural 
networks  is  quite  preliminary.  There  is  a 
strong  tendency  in  analyzing  such  networks 
to  view  them  through  traditional  lenses. 
We  suppose  that  if  information  is  not  con¬ 
tained  in  the  same  form  as  more  familiar 
computational  systems,  that  information  is 
somehow  buried,  inaccessible,  and  implicit 
For  instance,  a  network  may  successfully 
learn  some  complicated  mapping  —  say, 
from  text  to  pronunciation  (Sejnowski  & 
Rosenberg,  1987  —  but  on  inspecting  the  re¬ 
sulting  network,  it  is  not  immediately  obvi¬ 
ous  how  to  explain  how  the  mapping  works 
or  even  to  characterize  what  the  mapping  is 
in  any  precise  way.  In  such  cases,  it  is 
tempting  to  say  that  the  network  has 
learned  an  implicit  set  of  rules.  But  what 
we  really  mean  is  just  that  the  mapping  is 
"complicated",  "difficult  to  formulate",  or 
"unknown".  In  fact,  this  may  be  a  descrip¬ 
tion  of  our  own  failure  to  understand  the 
mechanism  rather  than  a  description  of  the 
mechanism  itself.  What  is  needed  are  new 
techniques  for  network  analysis,  such  as  the 
principal  component  analysis  used  in  the 
present  work,  contribution  analysis  (Sanger, 
1989),  weight  matrix  decomposition 
(McMillan  &  Smolensky,  1988),  or  skele¬ 
tonization  (Mozer  &  Smolensky,  1989). 

If  successful,  these  analyses  of  con¬ 
nectionist  networks  may  provide  us  with  a 
new  vocabulary  for  understanding  informa¬ 
tion  processing.  We  may  learn  new  ways 
in  which  information  can  be  explicit  or  implic¬ 
it,  and  we  may  learn  new  notations  for  ex¬ 
pressing  the  rules  that  underlie  cognition. 
The  notation  of  these  new  connectionist 
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rules  may  look  very  different  than  that  used 
in,  for  example,  production  rules.  And  we 
may  expect  that  the  notation  will  not  lend  it¬ 
self  to  describing  all  types  of  regularity  with 
equal  facility. 

Thus,  the  potential  important  differ¬ 
ence  between  connectionist  models  and 
Classical  models  will  not  be  in  whether  one 
or  the  other  systems  contains  rules,  or 
whether  one  system  encodes  information 
explicitly  and  the  other  encodes  it  implicitly; 
the  difference  will  lie  in  the  nature  of  the 
rules,  and  in  what  kinds  of  information  count 
as  explicitly  present. 

This  potential  difference  brings  us  to 
the  second  issue:  computational  power. 
The  issue  divides  into  two  considerations. 
Do  connectionist  models  provide  sufficient 
computational  power  (to  account  for  cogni¬ 
tive  phenomena);  and  do  they  provide  the 
appropriate  sort  of  computational  power? 

The  first  question  can  be  answered 
affirmatively  with  an  important  qualification. 
It  can  be  shown  that  multilayer  feedforward 
networks  with  as  few  as  one  hidden  layer, 
with  no  squashing  at  the  output  and  an  arbi¬ 
trary  nonlinear  activation  function  at  the  hid¬ 
den  layer,  are  capable  of  arbitrarily  accurate 
approximation  of  arbitrary  mappings.  They 
thus  belong  to  a  class  of  universal  approxi¬ 
mators  (Homik,  Stinchcombe,  &  White,  in 
press;  Stinchcombe  &  White,  1989).  Put 
simplistically,  they  are  effectively  Turing 
machines.  In  principle,  then,  such  networks 
are  capable  of  implementing  any  function 
that  the  Classical  system  can  implement. 

The  important  qualification  to  the 
above  result  is  that  sufficiently  many  hidden 
units  be  provided.  What  is  not  currently 
known  is  effect  of  limited  resources  on  com¬ 
putational  power.  Since  human  cognition  is 
carried  out  in  a  system  with  relatively  fixed 
and  limited  resources,  this  question  is  of 
paramount  interest.  These  limitations  pro¬ 
vide  critical  constraints  on  the  nature  of  the 


functions  which  can  be  mapped;  it  is  an  im¬ 
portant  empirical  question  whether  these 
constraints  explain  the  specific  form  of  hu¬ 
man  cognition. 

It  is  in  this  context  that  the  question 
of  the  appropriateness  of  the  computational 
power  becomes  interesting.  Given  limited 
resources,  it  is  relevant  to  ask  whether  the 
kinds  of  operations  and  representations 
which  are  naturally  made  available  are 
those  which  are  likely  to  figure  in  human 
cognition.  If  one  has  a  theory  of  cognition 
which  requires  sorting  of  randomly  ordered 
information,  e.g.,  word  frequency  lists  in 
Forster’s  (1979)  model  of  lexical  access, 
then  it  becomes  extremely  important  that 
the  computational  framework  provide  effi¬ 
cient  support  for  the  sort  operation.  On  the 
other  hand,  if  one  believes  that  information 
is  stored  associatively,  then  the  ability  of 
the  system  to  do  a  fast  sort  is  irrelevant 
Instead,  it  is  important  that  the  model  pro¬ 
vide  for  associative  storage  and  retrieval^ 
Of  course,  things  work  in  both  directions. 
The  availability  of  certain  types  of  opera¬ 
tions  may  encourage  one  to  build  models  of 
a  type  which  are  impractical  in  other  frame¬ 
works.  And  the  need  to  work  with  an  inap¬ 
propriate  computational  mechanism  may 
blind  us  from  seeing  things  as  they  really 
are. 

Let  us  return  now  to  the  current 
work.  I  would  like  to  discuss  first  some  of 
the  ways  in  which  the  work  is  preliminary 
and  limited.  Then  I  will  discuss  what  I  see 
as  the  positive  contributions  of  the  work. 
Hnally,  I  would  like  to  relate  this  work  to 
other  connectionist  research  and  to  the  gen¬ 
eral  question  raised  at  the  outset  of  this  dis¬ 
cussion:  How  viable  are  connectionist  mod¬ 
els  for  understanding  cognition? 

*This  example  was  suggested  (o  me  by  Don 
Norman. 
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The  results  are  preliminary  in  a  num¬ 
ber  of  ways.  First,  one  can  imagine  a  num¬ 
ber  of  additional  tests  that  could  be  per¬ 
formed  to  test  the  representational  capacity 
of  the  simple  recurrent  network.  The  memo¬ 
ry  capacity  remains  unprobed  (but 
see  Servan-Schreiber,  Cleeremans,  & 
McClelland,  1988).  Generalization  has 
been  tested  in  a  limited  way  (many  of  the 
tests  involved  novels  sentences),  but  one 
would  like  to  know  whether  the  network  can 
inferentially  extend  what  it  knows  about  the 
types  of  noun  phrases  encountered  in  the 
second  simulation  (simple  nouns  and  rela¬ 
tive  clauses)  to  noun  phrases  with  different 
structures. 

Second,  while  it  is  true  that  the 
agreement  and  verb  argument  structure 
facts  contained  in  the  present  grammar  are 
important  and  challenging.,  we  have  barely 
scratched  the  surface  in  terms  of  the  rich¬ 
ness  of  linguistic  phenomena  which  charac¬ 
terize  natural  languages. 

Third,  natural  languages  not  only 
contain  far  more  complexity  with  regard  to 
their  syntactic  structure,  they  also  have  a 
semantic  aspect.  Indeed,  Langacker  (1987) 
and  others  have  argued  persuasively  that  it 
is  not  firuitful  to  consider  syntax  and  se¬ 
mantics  as  autonomous  aspects  of  lan¬ 
guage.  Rather,  the  form  and  meaning  of  lan¬ 
guage  are  closely  entwined.  Although  there 
may  be  things  which  can  be  learned  by 
studying  artificial  languages  such  as  the 
present  one  which  are  purely  syntactic,  nat¬ 
ural  language  processing  is  crucially  an  at¬ 
tempt  to  retrieve  meaning  from  linguistic 
form.  The  present  work  does  not  address 
this  issue  at  all,  but  there  are  other  PDF 
models  which  have  made  progress  on  this 
problem  (e.g.,  St.  John  &  McClelland,  in 
press). 

What  the  current  work  does  contrib¬ 
ute  is  some  notion  of  the  representational 
capacity  of  connectionist  models.  Various 


writers  (e.g.,  Fodor  &  Pylyshyn,  1988)  have 
expressed  concern  regarding  the  ability  of 
connectionist  representations  to  encode 
compositional  structure  and  to  provide  for 
open-ended  generative  capacity.  The  net¬ 
works  used  in  the  simulations  reported  here 
have  two  important  properties  which  are 
relevant  to  these  concerns. 

First,  the  networks  make  possible 
the  development  of  internal  representations 
which  are  distributed  (Hinton,  1988;  Hinton, 
McClelland,  Rumelhart,  1986).  While  not 
unbounded,  distributed  representations  are 
less  rigidly  coupled  with  resources  than  lo- 
calist  representations,  in  which  there  is  a 
strict  mapping  between  concept  and  individ¬ 
ual  nodes..  There  is  also  greater  flexibility 
in  determining  the  dimensions  of  importance 
for  the  model. 

Second,  the  networks  studied  here 
build  in  a  sensitivty  to  context.  The  impor¬ 
tant  result  of  the  current  work  is  to  suggest 
that  the  sensitivity  to  context  which  is  char¬ 
acteristic  of  many  connectionist  models,  and 
which  is  built-in  to  the  architecture  of  the 
networks  used  here,  does  not  preclude  the 
ability  to  capture  generalizations  which  are 
at  a  high  level  of  abstraction.  Nor  is  this  a 
paradox.  Sensitivity  to  context  is  precisely 
the  mechanism  which  underlies  the  ability  to 
abstract  and  generalize.  The  fact  that  the 
networks  here  exhibited  behavior  which 
was  highly  regular  was  not  because  they 
learned  to  be  context-insensitive.  Rather, 
they  learned  to  respond  to  contexts  which 
are  more  abstractly  defined.  Recall  that 
even  when  these  networks’  behavior  seems 
to  ignore  context  (e.g..  Figure  lOd;  and 
Servan-Schreiber,  Cleeremans,  & 
McClelland,  1988),  the  internal  representa¬ 
tions  reveal  that  contextual  information  is 
still  retained. 

This  behavior  is  in  striking  contrast 
to  that  of  most  Classical  models.  Represen¬ 
tations  in  Classical  models  are  naturally 
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context-insensitive.  This  insensitivity 
makes  it  possible  to  express  generaliza¬ 
tions  which  are  fully  regular  at  the  highest 
possible  level  of  representation  (e.g.,  purely 
syntactic),  but  they  require  additional  appa¬ 
ratus  to  account  for  regularities  which  reflect 
the  interaction  of  meaning  with  form  and 
which  are  more  contextually  defined.  Con- 
nectionist  models  on  the  other  hand  begin 
the  task  of  abstraction  at  the  other  end  of 
the  continuum.  They  emphasize  the  impor¬ 
tance  of  context  and  the  interaction  of  form 
with  meaning.  As  the  current  work  demon¬ 
strates,  these  characteristics  lead  quite  nat¬ 
urally  to  generalizations  at  high  level  of  ab¬ 
straction  where  appropriate,  but  the  behav¬ 
ior  remains  ever-rooted  in  representations 
which  are  contextually  grounded.  The  simu¬ 
lations  reported  here  do  not  capitalize  on 
subtle  distinctions  in  context,  but  there  are 
ample  demonstrations  of  models  which  do 
(e.g.,  Kawamoto,  1988;  McClelland  & 
Kawamoto,  1986;  Miikkulainen  &  Dyer, 
1989;  St.  John  &  McClelland,  in  press). 

Finally,  I  wish  to  point  out  that  the 
current  approach  suggests  a  novel  way  of 
thinking  about  how  mental  representations 
are  constructed  from  language  input. 

Conventional  wisdom  holds  that  as 
words  are  heard,  listeners  retrieve  lexical 
representations.  Although  these  represen¬ 
tations  may  indicate  the  contexts  in  which 
the  words  acceptably  occur,  the  represen¬ 
tations  are  themselves  context-free.  They 
exist  in  some  canonical  form  which  is  con¬ 
stant  across  all  occurrences.  These  lexical 
forms  are  then  used  to  assist  in  constructing 
a  complex  representation  into  which  the 
forms  are  inserted.  One  can  imagine  that 
when  complete,  the  result  is  an  elaborate 
structure  in  which  not  only  are  the  words 
visible,  but  which  also  depicts  the  abstract 
grammatical  structure  which  binds  those 
words. 

In  this  account,  the  process  of  build¬ 


ing  mental  structures  is  not  unlike  the  pro¬ 
cess  of  building  any  other  physical  structure, 
such  as  bridges  or  houses.  Words  (and 
whatever  other  representational  elements 
are  involved)  play  the  role  of  building 
blocks.  As  is  true  of  bridges  and  houses, 
the  building  blocks  are  themselves  unaffect¬ 
ed  by  the  process  of  construction. 

A  different  image  is  suggested  in 
the  approach  taken  here.  As  words  are  pro¬ 
cessed  there  is  no  separate  stage  of  lexical 
retrieval.  There  are  no  representations  of 
words  in  isolation.  The  representations  of 
words  (the  internal  states  following  input  of 
a  word)  always  reflect  the  input  taken  to¬ 
gether  with  the  prior  state.  In  this  scenar¬ 
io,  words  are  not  building  blocks  as  much  as 
they  are  cues  which  guide  the  network 
through  different  grammatical  states. 
Words  are  distinct  from  each  other  by  virtue 
of  having  different  causal  properties. 

A  metaphor  which  captures  some  of 
the  characteristics  of  this  approach  is  the 
combination  lock.  In  this  metaphor,  the  role 
of  words  is  analogous  to  the  role  played  by 
the  numbers  in  the  combination.  The  num¬ 
bers  have  causal  properties;  they  advance 
the  lock  into  different  states.  The  effect  of  a 
number  is  dependent  on  its  context.  En¬ 
tered  in  the  correct  sequence,  the  numbers 
move  the  lock  into  an  open  state.  The  open 
state  may  be  said  to  be  functionally  compo¬ 
sitional  (van  Gelder,  in  press)  in  the  sense 
that  it  reflects  a  particular  sequence  of 
events.  The  numbers  are  "present"  insofar 
as  they  are  responsible  for  the  final  state, 
but  not  because  they  are  still  physically 
present. 

The  limitation  of  the  combination  lock 
is  of  course  that  there  is  only  one  correct 
combination.  The  networks  studied  here 
are  more  complex.  The  causal  properties  of 
the  words  are  highly  structure-dependent 
and  the  networks  allow  many  "open"  (i.e., 
grammatical)  states. 
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This  view  of  language  comprehen¬ 
sion  emphasizes  the  functional  importance 
of  representations  and  is  similar  in  spirit  to 
the  approach  described  in  Bates  & 
MacWhinney,  1982;  McClelland,  St,  John,  & 
Taraban,  1989;  and  many  others  who  have 
stressed  the  functional  nature  of  language. 
Representations  of  language  are  construct¬ 
ed  in  order  to  accomplish  some  behavior 
(where,  obviously,  that  behavior  may  range 
from  day-dreaming  to  verbal  duels,  and 
from  to  asking  directions  to  composing  poet¬ 
ry).  The  representations  are  not  proposi¬ 
tional,  and  their  information  content  changes 
constantly  over  time  in  accord  with  the  de¬ 
mands  of  the  current  task.  Words  serve  as 
guideposts  which  help  establish  mental 
states  that  support  this  behavior;  represen¬ 
tations  are  snapshots  of  those  mental 
states. 
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